The Library of Congress updates their Recommended Formats Statement regularly. This is a helpful quick reference for selecting a format that is stable when there is an opportunity to choose. If converting data from a proprietary format to an open file format results in some data loss, consider saving both. For less established or proprietary formats, consider recording the type, version, and software used to generate and play the file—this can be included in the metadata or documentation.
These guidelines may also be considered during file format selection:
13. Acquire the highest quality version of media to use for preservation
34. For EPUBs, opt for core media types, as defined by the EPUB specification
If a publication is document-like, exporting and transforming the core intellectual components to an existing standard for full text publications e.g. to EPUB, TEI, or JATS/BITS XML is a robust approach. This includes publications that contain multimedia or remote content since these enhanced features can be managed more easily at scale when the rest of the publication is expressed in a standard form. Existing standards can be validated at scale, support both platform migration and preservation, and may steer enhanced features to be expressed more consistently to work with the document.
These guidelines may also be helpful when considering the export package for a linear publication:
3. Use existing standards for export formats
10. Identify and document the core intellectual components of a work
20. Ensure exports cover all core intellectual components
Excessive small metadata files or a complex folder hierarchy within an export package adds complexity to the workflow. Ideally, export processes consolidate metadata into one file per publication, and the folder and file structure are mostly flattened, predictable, and use a consistent naming convention. Metadata should be fully expressed within the metadata file, not via file- and folder names, and should include references to the files being described so that they are easily connected. The complexity of a submitted information package has an impact on the ability of a preservation service to efficiently and quickly convert it to an archival information package. Reducing the number of separate metadata files and folders reduces processing time and can improve stability in the long term by simplifying migration either to a preservation system or to another platform. To the extent that the goal is an automated preservation workflow, the export packages should be consistent across publications.
See also:
22. Use an appropriate metadata serialization within the export package
In addition to the main text and embedded or supplemental media, other features or content such as annotations, high-quality versions of media, supporting data, and peer reviews may be considered integral to the work in some cases. If so, these resources should be part of the export package so that they can be preserved alongside the publication. Special provisions may need to be made for artifacts that are hosted outside of the platform to include them in the export.
See also:
10. Identify and document the core intellectual components of a work
Each publication should have structured bibliographic metadata associated with it. This should be expressed as a separate file stored adjacent to or within the publication package. When possible, this should be expressed in a standard format such as e.g. ONIX, JATS, or Dublin Core. In order to process metadata at scale, the file naming convention, location of the file relative to the publication, and format should all remain consistent.
These guidelines may add context when deciding how to format bibliographic metadata and where to store it:
3. Use existing standards when creating metadata
22. Express metadata in an appropriate structured format
30. Add bibliographic metadata to an EPUB
45. Embed bibliographic metadata in a web page
When exporting metadata, ensure that the data format used to express it is appropriate for the content. For example, a CSV file will work for very simple metadata, but if the fields contain formatting, values that include new lines, or express specific data types, a CSV export could become unreliable or difficult to process. A structured format such as JSON or XML is generally more appropriate and can be validated for errors more easily.
Current publishing platforms can support frequent updates and new versions. These should be expressed clearly through the metadata so that the preserved copies can be properly distinguished from each other. If something has changed, it should be reflected in the version and date and where necessary, new exports should be provided.
These guidelines also relate to versioning:
9. Determine the version of record in you context
31. Assign new identifiers to significant versions of a work
Many publication resources that are supported by modern publishing platforms warrant their own description to ensure they are properly credited, interpreted, and rendered with context in the future. Where possible, include descriptive metadata for each resource. Use an existing standard for guidance on what to include, e.g. Dublin Core.
These guidelines add additional context to creating metadata for publication resources:
16. Captions for non-text features add meaningful context
22. Express metadata in an appropriate structured format
25. Express the license information in the resource-level metadata
26. Describe connections between resources in the metadata
27. Assign and use unique persistent identifiers for publication resources
When a publisher acquires rights for resources that are part of the publication, these should also include rights pertaining to the preservation of those resources. Express these rights in the metadata in a way that allows a preservation institution to determine what they have permission to preserve and relate them to the relevant material.
These guidelines may also support the creation of license metadata:
8. Clarify the license related to preserving third party web resources
24. Create descriptive metadata for each publication resource
40. Embed license information in the HTML
While developing export processes, attention should be given to describing each resource in the package. If the relationships between the resources are also significant, ensure that this is expressed in the metadata as well. For example, if several data files are dependent on each other, or two items are versions of the same thing, or something should interact with the publication in a specific way, these relationships should be expressed so that they can remain connected in the preserved copy. Ask, what information is needed to restitch the seams between the resources in your package?
See also:
27. Assign persistent identifiers to publication resources, they can help perpetuate connections between resources
Correct handling of character encoding can make an enormous difference to whether a publication is properly rendered. Encoding type should be expressed in the metadata, and/or within the publication as appropriate for the format. For example, websites may include encoding in the metatags and/or the charset property of the HTTP headers.
Do not send administrative data to a preservation archive unless it is integral to the work. For example, when exporting a SQL database, you may need to exclude or anonymize the content from user tables, indexes that support a specific UI, non-public communications, or logs. Only archive the essential data that can be made publicly accessible.
These guidelines refer to the creation of the installation package:
61. Create installation packages for custom websites that don’t require a live server
62. Create installation packages for custom websites that do require a live server
For data, software, or any resource that has a complex arrangement of files, if structured metadata cannot be supplied, a common convention is to include a README file from the author. Written using a plain text file format, this should be a note to future users who wish to use the files. It should include information such as, scope, purpose, author(s), relevant dates, license for reuse, dependencies, field names/descriptions, and instructions for use.
See also:
68. Provide documentation for software
Consider what a future user of the software might need to know to run the software and understand how it should work. Ensure this is covered by the documentation. For example, what is the software for? What are the supported operating systems and versions? Are there any dependencies or requirements? How do you install it? How do you use it? What should it do if it is working? What is its license? In the case where software is not possible to preserve, visual and narrative documentation of the user experience can provide vital context.
This guideline refers to another common method for documenting software:
66. Use a README file to document data or software